I. Introduction

In this project I will explore how, if at all, airine fuel costs drive airline fuel consumption. In theory, I expect time periods with higher fuel costs to associate with lower consumption, and vice versa. The data for this analysis comes from the Bureau of Transportation Statistics. It contains monthly measures of US consumption of airline fuel in millions of gallons and the cost per gallon in dollars. The data spans from 2000 until 2017, though I will split the data from 2016-2017 for a test set to evaluate forecasts. First, I will analyze the consumption and cost time series independently, model them, and evaluate their forecasts. Afterwards, I will create a bivariate VAR model to confirm or deny the hypothesis that fuel costs have predictive power over fuel consumption.


II. Results

Univariate Analysis

(a) Produce a time-series plot of your data including the respective ACF and PACF plots.

From the time series plot we can observe strong seasonality as well as cycles. There does not appear to be a clear trend, though it is mostly downwards. The ACF plot exhibits a slow decay to zero with higher barlett bands at month 12 and lower ones at month 6. The PACF also exhibits some decaying, oscillating behavior. This suggests the series is not a simple AR process, but has an MA component as well. From these plots, I would try testing some S-ARMA models with varying AR and MA orders.

The time series plot of airline cost per gallon exhibits an overall upwards trend, but again is not clear. The ACF plot suggests strong time dependence, and not necessarily any seasonal components. The PACF shows time dependence up to two lags and is otherwise statistically zero. From these information I would guess that an AR(1) or AR(2) model would be appropriate.

(b) Fit a model that includes, trend, seasonality and cyclical components. Make sure to discuss your model in detail.

## Series: cons_ts 
## ARIMA(2,1,2)(2,1,2)[12] 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2    sar1     sar2     sma1    sma2
##       -1.1687  -0.2852  0.9766  0.0284  0.8199  -0.2437  -1.7434  0.9461
## s.e.   0.2336   0.2255  0.2425  0.2386  0.1486   0.1261   0.5088  0.5366
## 
## sigma^2 estimated as 559.6:  log likelihood=-833.76
## AIC=1685.52   AICc=1686.58   BIC=1714.2
## 
## Training set error measures:
##                     ME     RMSE      MAE        MPE     MAPE   MASE
## Training set 0.6712457 22.32526 14.43373 0.07036726 1.486128 0.2654
##                     ACF1
## Training set 0.003015339

I fit several Seasonal ARIMA models to the airline consumption time series and found the lowest BIC score with an ARIMA(2, 1, 2)(2, 1, 2) (and frequency of 12). In total there are eight parameters, two for each AR, MA, S-AR, and S-MA. The magnitudes of each of the parameters are low, as they range from about [-2, 1]. We can note that both AR(1) and AR(2) components have negative signs, while both MA(1) and MA(2) components carry positive signs. The seasonal AR and MA components have different signs with each other from the first and second orders. The standard errors for each of the is low, so we can hope that these parameters are good estimates. The standard deviation is 23.7, which is good with consideration of the scale of the data. The time series plot with the fitted values and observed values shows a good fit, but there are some troughs and peaks that are underpredicted in magnitude.

## Series: cost_ts 
## ARIMA(1,1,0) 
## 
## Coefficients:
##          ar1
##       0.3239
## s.e.  0.0684
## 
## sigma^2 estimated as 0.01435:  log likelihood=134.74
## AIC=-265.48   AICc=-265.42   BIC=-258.98
## 
## Training set error measures:
##                      ME      RMSE        MAE       MPE     MAPE      MASE
## Training set 0.00241706 0.1191595 0.08126944 0.1431169 4.490751 0.9334013
##                     ACF1
## Training set -0.01439193

The cost per gallon model I ended up with was an AR(1). As the time series itself did not exhibit any seasonality, and appears to have high persistence, this did not come at a surprise. The fitted values over the observed values visually indicates a good fit to the data, as this series does not have as complex dynamics as the consumption series. The standard deviation is small at 0.11, along with the other calculated error metrics. The coefficient for the AR(1) component is 0.3239, which indicates a moderate to low level of persistence. Also, note that while the PACF of the series suggested an AR(2) model, the AR(1) model had a lower BIC score.

(c) Plot the respective residuals vs. fitted values and discuss your observations.

Besides a few outliers, the consumption model residuals look consistently centered around zero. It will be easier to evaluate the model performance in looking at the ACF and PACF of the residuals.

The cost model residuals also appear to be somewhat consistently centered around zero. There are some outliers as well as some volatility clustering, but this is at a small scale so I will dismiss this pattern has meaning.

(e) Plot the ACF and PACF of the respective residuals and interpret the plots.

The ACF and PACF of the consumption model residuals resemble a white noise process. So the ARIMA(2, 1, 2)(2, 1, 2) model effectively teased out the seasonality and cyclical nature of the consumption series.

## 
##  Box-Pierce test
## 
## data:  df$cost_residuals
## X-squared = 3.6765, df = 7, p-value = 0.8162

The ACF and PACF plot of the cost model residuals, for the most part, also resemble white noise. There is a significant barlett band at a lag of 11. The Box-Pierce test gives us a X-squared of 3.6765 and a p-value of 0.8162 so we cannot reject the null hypothesis that the data are independently distributed.

(f) Plot the respective CUSUM and interpret the plot.

The consumption model CUSUM plot shows that the cumulative sum of the residuals stays within the confidence interval, so we can conclude that the model does not break at any specific time. There is some low magnitude pattern of a persisent above zero, then zero, then above zero region.

Likewise, the CUSUM plot of the cost model residuals shows the cumulative sum within the confidence interval. So we can also conclude that the cost model does not break at a specific point in time.

(g) Plot the respective Recursive Residuals and interpret the plot.

To further confirm from part f, the recursive residuals plot is consistent about zero, with only two outliers that were not adequately captured by the model. There is no structural break in the pattern of the residuals.

The cost model recursive residuals, like the consumption model, does not appear to break anywhere. The recursive residuals mostly revolve about zero, with the exception of some negative outliers in the middle and one positive outlier at the end.

(h) For your model, discuss the associated diagnostic statistics.

## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## ar1  -1.168707   0.233636 -5.0023 5.666e-07 ***
## ar2  -0.285160   0.225538 -1.2644  0.206104    
## ma1   0.976561   0.242537  4.0264 5.663e-05 ***
## ma2   0.028406   0.238630  0.1190  0.905247    
## sar1  0.819940   0.148635  5.5165 3.459e-08 ***
## sar2 -0.243710   0.126118 -1.9324  0.053311 .  
## sma1 -1.743412   0.508843 -3.4262  0.000612 ***
## sma2  0.946072   0.536620  1.7630  0.077897 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Aside from the statistics already discussed in part b, we can still look at the significance of the parameters and some other tests. From the coefficient test, we note that only four of the estimated parameters are statistically significant. Those being the AR(1), MA(1), S-AR(1), and S-MA(1) terms. So the second period lag coefficients are determined to not contain statistical significance and the consumption time series is a short-term dynamics process.

## [1] "Skewness: -1.33766418965007"
## [1] "Kurtosis: 16.1370353249231"

We can also look more closely at the distribution of the residuals. The residuals are not normally distributed, they have a slight left skew of -1.34 and a much higher kurtosis at 16.14. So the ARIMA model could use improvement with respect to extreme values.

## 
## z test of coefficients:
## 
##     Estimate Std. Error z value Pr(>|z|)    
## ar1 0.323861   0.068396  4.7351 2.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The cost model only had one parameter, AR(1), which is statistically significant.

## [1] "Skewness: -1.04493542875793"
## [1] "Kurtosis: 8.28320265232361"

Similar to the consumption model residuals, the cost model residuals are also slightly skewed to the left and have a higher than Normal kurtosis. Both the series have more outliers in the data than a normal distribution.

(i) Use your model to forecast 12-steps ahead. Your forecast should include the respective error bands.

## [1] "RMSE: 72.7847102735522"

Take a visual note of how the ARIMA model forecasts with cyclical and seasonal behavior. It appears that this model is forecasting a period of higher fuel consumption, and then a period of lower consumption. We can also calculate the RMSE score between the saved test set of observed values from 2016 to mid 2017 and the forecasted values This model gave a RMSE of 72.78 over these 20 periods. This will serve as a comparison point for the multivariate analysis.

## [1] "RMSE: 0.197792765644361"

Since our model for the cost series was an AR(1), it really only has predictive power at one-step-ahead. So we see a fast convergence to the unconditional mean. The RMSE for the forecasts versus the observed values from 2016 to mid 2017 is 0.198. This will serve as a comparison point for the multivariate analysis.

VAR Analysis

(i) Fit an appropriate VAR model using your two variables. Make sure to show the relevant plots and discuss your results from the fit.

## 
## VAR Estimation Results:
## ========================= 
## Endogenous variables: cons_ts, cost_ts 
## Deterministic variables: const 
## Sample size: 190 
## Log Likelihood: -732.351 
## Roots of the characteristic polynomial:
## 0.968 0.968 0.3147 0.3109
## Call:
## VAR(y = var_df, p = 2, season = 12L)
## 
## 
## Estimation results for equation cons_ts: 
## ======================================== 
## cons_ts = cons_ts.l1 + cost_ts.l1 + cons_ts.l2 + cost_ts.l2 + const + sd1 + sd2 + sd3 + sd4 + sd5 + sd6 + sd7 + sd8 + sd9 + sd10 + sd11 
## 
##              Estimate Std. Error t value Pr(>|t|)    
## cons_ts.l1    0.63914    0.07172   8.912 6.63e-16 ***
## cost_ts.l1   13.99464   15.25978   0.917  0.36036    
## cons_ts.l2    0.29027    0.07062   4.110 6.08e-05 ***
## cost_ts.l2  -21.09596   15.53287  -1.358  0.17617    
## const        82.46964   27.56823   2.991  0.00318 ** 
## sd1         -60.05362   11.05383  -5.433 1.85e-07 ***
## sd2        -117.64151    9.12559 -12.891  < 2e-16 ***
## sd3          66.84705    9.31056   7.180 1.95e-11 ***
## sd4         -37.41183   15.05966  -2.484  0.01393 *  
## sd5         -20.74247    9.05357  -2.291  0.02316 *  
## sd6          -5.70953   10.31865  -0.553  0.58075    
## sd7          12.56759   10.08250   1.246  0.21427    
## sd8         -29.29400   10.87607  -2.693  0.00776 ** 
## sd9        -159.34991    9.58363 -16.627  < 2e-16 ***
## sd10        -27.17109   10.72204  -2.534  0.01215 *  
## sd11        -63.24689   11.00607  -5.747 4.00e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 25.27 on 174 degrees of freedom
## Multiple R-Squared: 0.9592,  Adjusted R-squared: 0.9556 
## F-statistic: 272.5 on 15 and 174 DF,  p-value: < 2.2e-16 
## 
## 
## Estimation results for equation cost_ts: 
## ======================================== 
## cost_ts = cons_ts.l1 + cost_ts.l1 + cons_ts.l2 + cost_ts.l2 + const + sd1 + sd2 + sd3 + sd4 + sd5 + sd6 + sd7 + sd8 + sd9 + sd10 + sd11 
## 
##              Estimate Std. Error t value Pr(>|t|)    
## cons_ts.l1  0.0001869  0.0003398   0.550   0.5830    
## cost_ts.l1  1.3002675  0.0722953  17.985  < 2e-16 ***
## cons_ts.l2 -0.0001040  0.0003346  -0.311   0.7563    
## cost_ts.l2 -0.3082072  0.0735892  -4.188 4.46e-05 ***
## const      -0.0642898  0.1306084  -0.492   0.6232    
## sd1         0.0626358  0.0523691   1.196   0.2333    
## sd2         0.0991206  0.0432338   2.293   0.0231 *  
## sd3         0.0715509  0.0441101   1.622   0.1066    
## sd4         0.0500074  0.0713473   0.701   0.4843    
## sd5         0.0704769  0.0428926   1.643   0.1022    
## sd6         0.0125026  0.0488861   0.256   0.7984    
## sd7         0.0708070  0.0477673   1.482   0.1401    
## sd8         0.0491498  0.0515269   0.954   0.3415    
## sd9         0.0362744  0.0454038   0.799   0.4254    
## sd10        0.0673874  0.0507972   1.327   0.1864    
## sd11       -0.0160705  0.0521428  -0.308   0.7583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## Residual standard error: 0.1197 on 174 degrees of freedom
## Multiple R-Squared: 0.9836,  Adjusted R-squared: 0.9822 
## F-statistic: 696.6 on 15 and 174 DF,  p-value: < 2.2e-16 
## 
## 
## 
## Covariance matrix of residuals:
##          cons_ts  cost_ts
## cons_ts 638.4600 -0.20369
## cost_ts  -0.2037  0.01433
## 
## Correlation matrix of residuals:
##          cons_ts  cost_ts
## cons_ts  1.00000 -0.06734
## cost_ts -0.06734  1.00000

From the summary of our VAR models we have equations for the consumption time series and the cost time series. Because the purpose of the assignment was to analyze whether or not fuel costs affected fuel consumption, only the first model is relevant to us. We can see from the coefficient estimates that, like the consumption model, there are significant AR(1) and AR(2) components. However, contrary to my hypothesis, the lagged cost variables do not have statistical significance. This came as a surprise to me, as I believed from an economic standpoint that the supply and demand forces would play into the interaction between these two phenomena. Additionally, we see seasonal significance for all months except for June and July (depending on the confidence level chosen). Here, the residual standared error is 25.27, which is slightly higher than the univariate model’s 23.7. So as of now, it appears that the consumption model performs better on itself than inclusion of the cost series. The correlation matrix indicates a weak negative correlation between the variables. This is what I expected, because one would think higher prices would lead to less consumption and vice versa, however the magnitude is much smaller than I theorized.

As for the cost model, only lagged values of itself are statistically significant. This comes at no surprise intuitively. There is also one significant seasonal variable for February. The residual standard error is 0.1197, which is almost identical to the univariate cost model. So, we can conclude that seasonailty is not a strong force in the cost time series and that the inclusion of the consumption series did not have much of an influence on cost modeling.

From the diagrams above, the overall fit to the data looks similar to the univariate model’s fit. The residuals look mostly like white noise, besides the outlier in the beginning and one significant spike in the PACF at lag 9. Likewise, the multivariate cost diagrams do not differ significantly from the univariate model diagrams. Both model residuals for the cost series show significant barlett bands at lags of 11 in the ACF and PACF that were not adequately captured.

Another plot to consider is the Cross Correlation Function. As we saw from the correlation matrix, there is an obvious negative relationship between fuel consmption and cost. Both to the right and left of zero there is a slow decay to zero, though the right side exhibits some slight cyclical pattern and overall stronger signals. Note that the scale of the plot maxes out around -0.6, so these correlations are very strong, explaining how cost is not a primary driver of consumption.

(j) Compute, plot, and interpret the respective impulse response functions.

The IRF plot from consumption indicates a moderate immediate response, and then a slow decay to zero afterwards from itself. And the cost response from consumption is essentially zero. As discussed earlier, we do not expect consumption to have any driving power over cost. The IRF from cost however, gets a small response from consumption, which then slowly decays below zero. So changes in cost do in fact have an inverted relationship with consumption, but not of a great magnitude.

(k) Perform a Granger-Causality test on your variables and discuss your results from the test.

## Granger causality test
## 
## Model 1: cons_ts ~ Lags(cons_ts, 1:2) + Lags(cost_ts, 1:2)
## Model 2: cons_ts ~ Lags(cons_ts, 1:2)
##   Res.Df Df      F   Pr(>F)   
## 1    185                      
## 2    187 -2 5.0474 0.007339 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Granger causality test
## 
## Model 1: cost_ts ~ Lags(cost_ts, 1:2) + Lags(cons_ts, 1:2)
## Model 2: cost_ts ~ Lags(cost_ts, 1:2)
##   Res.Df Df      F Pr(>F)
## 1    185                 
## 2    187 -2 0.3475 0.7069

The Granger-Causality test of cost on consumption gives us a p-value of 0.007, so depending on the chosen confidence level we can conclude that cost does Granger-Cause consumption. But as we have seen before, the cost per gallon does not necessarily drive consumption patterns, and only has a negative relationship.

The Granger-Causality test of consumption on cost gives a p-value of 0.71, so we can conclude that consumption does not Granger-Cause costs.

(l) Use your VAR model to forecast 12-steps ahead. Your forecast should include the respective error bands. Comment on the differences between the two forecasts (VAR vs. ARMA).

## [1] "RMSE Consumption: 76.010971103499"
## [1] "RMSE Cost: 0.139559662241529"

The ultimate goal of the project was to forecast airline fuel consumption, and to see whether or not adding fuel cost per gallon to a model could help in this process. At a visual perspective, the multivariate consumption model forecasts follow a similar pattern to the univariate forecasts. The seasonal and cyclical nature are both accounted for. However, the RMSE was 76.01, which is slightly higher than the univariate model’s of 72.78. For the cost series, the forecast of the multivariate model better incorporates seasonality, and reflects this in its RMSE of 0.14 as compared to the univariate model’s of 0.197.


III. Conclusions and Future Work

The purpose of this project was to explore whether or not airline fuel consumption and fuel cost follow a traditional economic model of supply and demand. My overall hypothesis was that fuel consumption and fuel cost per gallon would have a strong negative correlation, and thus fuel costs would be effective in predicting future consumption. The applications of a study like this would be for any economic process where aviation transportation plays a large part. So this could be things like high speed international trade and tourism. Ideally, airlines can forecast the demand for air travel and how much fuel will be needed over the course of time.

From the analysis above, the main discovery was that airline fuel consumption and airline fuel cost per gallon are in fact negatively associated with each other, but not at a strong enough magnitude to have any predictive power. We compared the univariate consumption and cost models to the multivariate VAR models. For consumption, I found the univarite model with AR and MA components to be more performant than the VAR model that included a cost time series. On the other hand, the VAR cost model performed better than the univariate one. These two comparisons were carried out by computing an RMSE score for each model on a dedicated test set of data from 2016 to mid 2017.

There are undoubtedly many things that could improve a study like this. First, and foremost, more domain knowledge on the airline industry and the process of transportation is absolutely necessary. There are many questions that I am sure I did not even know to ask about the relationship between fuel consumption and costs. Second, more variables could be added in to the VAR modelling component. A process such as airline fuel consumption is a highly dynamic system and I imagine many different forces come into play. From an economic standpoint, air travel could be thought of as a necessary good in some cases. So a different in fuel price may not make an impact of flight demand as people may need to travel regardless. The last thing that be improved up in the future is a deeper knowledge of the R packages vars and forecast that I used to fit models. Because the models contained both AR, MA, and seasonal components, I was not sure how to hard code in AR and MA components with additive seasonal dummy variables. To highlight this constraint, we saw that the VAR cost model performed better than the univariate model in terms of RMSE even though we know that the cost series does not exhibit seasonality. So with that, a more advancing knowledge of model building would be beneficial in having the ability to fine tune models for this particular data.


IV. References

“Airline Fuel Cost and Consumption (U.S. Carriers - Scheduled).” Bureau of Transportation Statistics, www.transtats.bts.gov/fuel.asp.


V. R Source Code

# Set global rmarkdown settings
knitr::opts_chunk$set(root.dir='/Users/noahkawasaki/Desktop/ECON 144/Week 9', 
                      echo=FALSE, 
                      warning=FALSE, 
                      message=FALSE, 
                      fig.width=9, fig.height=5, fig.align="center")

# Load libraries
library(tidyverse)
library(tseries)
library(forecast)
library(strucchange)
library(vars)
library(lmtest)
library(car)
library(Metrics)
library(stats)
library(moments)


# Read in data and select needed variables
data <- read_csv("/Users/noahkawasaki/Desktop/ECON 144/Week 9/BTS-FUEL.csv") %>%
  dplyr::select(1, 2, 4) %>%
  set_names("date", "consumption", "cost") 

# Reverse dataframe from early to latest dates, separate train and test sets
df <- data[seq(dim(data)[1],1),][1:192, ]
test_df <- data[seq(dim(data)[1],1),][192:211, ] # 2016 - July 2017

# Make ts objects
cons_ts <- ts(df$consumption)
cost_ts <- ts(df$cost)


## (a)
# Consumption
ggplot(df, aes(x=date, y=consumption)) +
  geom_line(color="#3796db", lwd=0.8) +
  ggtitle("US Consumption of Airline Fuel", "2000-2015") +
  xlab("Date") +
  ylab("Gallons (millions)")
  
par(mfrow=c(2, 1))
acf(df$consumption, main="ACF - Consumption", lag.max=36)
pacf(df$consumption, main="PACF - Consumption", lag.max=36)

# Cost
ggplot(df, aes(x=date, y=cost)) +
  geom_line(color="orange", lwd=0.8) +
  ggtitle("US Cost per Gallon of Airline Fuel", "2000-2015") +
  xlab("Date") +
  ylab("Dollars")

par(mfrow=c(2,1))
acf(df$cost, main="ACF - Cost", lag.max=36)
pacf(df$cost, main="PACF - Cost", lag.max=36)


## (b)
# Create Models
cons_model <- Arima(cons_ts, order=c(2,1,2), seasonal=list(order=c(2,1,2), period=12))
cost_model <- Arima(cost_ts, order=c(1,1,0))

# Add model attributes as data to df for plotting purposes
df <- df %>%
  mutate(
    consumption_fitted = cons_model$fitted,
    consumption_residuals = cons_model$residuals,
    cost_fitted = cost_model$fitted,
    cost_residuals = cost_model$residuals
  )

# Consumption
cons_cols <- c("Fitted Values"="black", "Observed Values"="#3796db")  # Named vector for legend mapping
ggplot(df, aes(x=date, y=consumption)) +
  geom_line(aes(color="Observed Values"), lwd=0.8) +
  geom_line(aes(y=consumption_fitted, color="Fitted Values")) +
  ggtitle("US Consumption of Airline Fuel", "2000-2015") +
  xlab("Date") +
  ylab("Gallons (millions)") +
  scale_color_manual("Legend", values=cons_cols)

summary(cons_model)

# Cost
cost_cols <- c("Fitted Values"="black", "Observed Values"="orange")  # Named vector for legend mapping
ggplot(df, aes(x=date, y=cost)) +
  geom_line(aes(color="Observed Values"), lwd=0.8) +
  geom_line(aes(y=cost_fitted, color="Fitted Values")) +
  ggtitle("US Cost per Gallon of Airline Fuel", "2000-2015") +
  xlab("Date") +
  ylab("Dollars") +
  scale_color_manual("Legend", values=cost_cols)

summary(cost_model)


## (c)
# Consumption
ggplot(df, aes(x=consumption_fitted, y=consumption_residuals)) +
  geom_line(color='green', alpha=0.9) +
  geom_point(color='green', alpha=0.9) +
  geom_hline(yintercept=mean(df$consumption_residuals), color='black', linetype='dashed') +
  ggtitle("Residuals vs Fitted Values", "Consumption Model") +
  xlab("Fitted Values") +
  ylab("Residuals")

# Cost
ggplot(df, aes(x=cost_fitted, y=cost_residuals)) +
  geom_line(color='green', alpha=0.9) +
  geom_point(color='green', alpha=0.9) +
  geom_hline(yintercept=mean(df$cost_residuals), color='black', linetype='dashed') +
  ggtitle("Residuals vs Fitted Values", "Cost Model") +
  xlab("Fitted Values") +
  ylab("Residuals")


## (e)
# Consumption
par(mfrow=c(2,1))
acf(df$consumption_residuals, main="ACF - Consumption Model Residuals")
pacf(df$consumption_residuals, main="PACF - Consumption Model Residuals")

# Cost
par(mfrow=c(2,1))
acf(df$cost_residuals, main="ACF - Cost Model Residuals")
pacf(df$cost_residuals, main="PACF - Cost Model Residuals")

Box.test(df$cost_residuals, lag=(11-2-2))  # 11 lags - 2 AR - 2 MA


## (f)
# Consumption
plot(efp(df$consumption_residuals~1, type="Rec-CUSUM"), main="Consumption Model CUSUM Test")

# Cost
plot(efp(df$cost_residuals~1, type="Rec-CUSUM"), main="Cost Model CUSUM Test")


## (g)
# Consumption
plot(recresid(df$consumption_residuals~1), pch=16, main="Consumption Model Recursive Residuals",
     ylab="Residuals")

# Cost
plot(recresid(df$cost_residuals~1), pch=16, main="Consumption Model Recursive Residuals",
     ylab="Residuals")


## (h)
# Consumption
coeftest(cons_model)  # Statistical Significance of parameters

ggplot(df) +
  geom_qq(aes(sample=consumption_residuals), color='black', alpha=0.6) +
  ggtitle('QQ Normal Plot of Residuals', 'Consumption Model') +
  ylab('Sample Quantiles') +
  xlab('Theoretical Quantiles')

print(paste("Skewness:", skewness(df$consumption_residuals)))
print(paste("Kurtosis:", kurtosis(df$consumption_residuals)))

# Cost
coeftest(cost_model)  # Statistical Significance of parameters

ggplot(df) +
  geom_qq(aes(sample=cost_residuals), color='black', alpha=0.6) +
  ggtitle('QQ Normal Plot of Residuals', 'Cost Model') +
  ylab('Sample Quantiles') +
  xlab('Theoretical Quantiles')

print(paste("Skewness:", skewness(df$cost_residuals)))
print(paste("Kurtosis:", kurtosis(df$cost_residuals)))


## (i)
# Consumption
cons_forecasts <- forecast(cons_model, h=20)
plot(cons_forecasts)

print(paste("RMSE:", rmse(test_df$consumption, cons_forecasts$mean)))

# Cost
cost_forecasts <- forecast(cost_model, h=20)
plot(cost_forecasts)

print(paste("RMSE:", rmse(test_df$cost, cost_forecasts$mean)))


## (i)
# Model
var_df <- data.frame(cbind(cons_ts, cost_ts))  # VAR takes df
var <- VAR(var_df, p=2, season=12)
summary(var)

plot(var)

# CCF
ccf(cons_ts, cost_ts, main="CCF - Consumption and Cost")


## (j)
plot(irf(var))


## (k)
grangertest(cons_ts ~ cost_ts, order=2)
grangertest(cost_ts ~ cons_ts, order=2)


## (l)
var_forecasts = predict(object=var, n.ahead=20)
plot(var_forecasts)

# Access forecast vectors from var_forecasts object
var_cons_forecasts <- var_forecasts[[1]][[1]][,1]
var_cost_forecasts <- var_forecasts[[1]][[2]][,1]

# RMSE
print(paste("RMSE Consumption:", rmse(test_df$consumption, var_cons_forecasts)))
print(paste("RMSE Cost:", rmse(test_df$cost, var_cost_forecasts)))